[WIP] Area-weighted interpolation in Dask #180

darribas · 2023-08-11T10:54:18Z

This PR adds functionality to run area-weighted interpolation (AWI) through Dask. The code heavily leverages dask-geopandas (learning a lot from their distributed spatial join) and makes it possible to run AWI out-of-core (i.e., larger-than-memory datasets), distributed (i.e., across different cores or machines), and in parallel (with a significantly more efficient algorithm than the current multi-core implementation).

A notebook (for now on root, if this gets merged, it should be moved to notebooks) is provided that demonstrates how the method works, correctness, and performance.

I'd also like to use this PR conversation to discuss whether you think this is the right place to contribute and if you'd modify anything on the API/location of the code.

NOTE: currently, only interpolation of categorical variables is implemented. My preference, unless someone can provide an implementation for extensive/intensive variables quickly, would be this does not block merging (and possibly cutting a small release). Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested. I need to think (and would welcome thoughts!) how intensive/extensive should be done in this framework. I think it needs a second step of reweighting/interpolating when collecting chunks from the workers. For categorical, this step is simply adding up all the values for a given target geometry, but this is not the case for intensive/extensive.

Before merging, the following (at least) needs to be ticked off:

Fix current "can't add float and str" bug that arises sometimes when many partitions are used (my hunch here is that it comes from workers returning None rather than a full-fledge empty table, but needs exploration). Edit: There's now an illustration of the bug in the notebook.
add dask/daskgpd to the ci environment files to get the tests to execute
Add tests
Move demo notebook from source to notebooks folder
Explore AWI on extensive and intensive variables.

I'm happy to push through as the discussion evolves to add the remaining bits. I'd love to see this merged.

…eview

knaaptime · 2023-08-11T17:13:59Z

sweet, I'll take a look. A couple quick reactions

it would be good to do some profiling (including the sunk costs) against the in-core functions to give users a sense for when this actually provides a benefit
how heavy are dask/daskgpd? I assume we probably want those as optional dependencies
apart from the walkthrough, can you include a notebook with an actual use case (i.e. that shows what this is actually useful for)?

Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested.

it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters

knaaptime · 2023-08-11T18:27:45Z

(need to add dask/daskgpd to the ci environment files to get the tests to execute)

darribas · 2023-08-14T09:09:13Z

it would be good to do some profiling (including the sunk costs) against the in-core functions to give users a sense for when this actually provides a benefit

The notebook has already a bit on this. With the baseline data (in the small hundreds) it's clearly overkill but, as soon as you move to a few thousands, it seems worthwhile. The main catch with all this is that dask is just a way of better using better hardware. The example there is in a beefy machine with 16 cores, but if you have a machine with two cores you won’t see much benefit other than out of core support (i.e., when your data don't fit in memory).

how heavy are dask/daskgpd? I assume we probably want those as optional dependencies

I'm not sure what you mean by "heavy"? I think they're mostly python by not sure. Install is straightforward on conda/mamba. I've added them to the environment.yml on the PR and the notebook included is run on an environment created with that file. Dask is also widely embedded in the "stack" (e.g., xarray supports multi/out-of core with it by default, scikit-learn can use it for their grid search, etc.). For tobler I'd say, for now, an optional dependency would be the way to go.

apart from the walkthrough, can you include a notebook with an actual use case (i.e. that shows what this is actually useful for)?

I can certainly add another illustration of how this can be used, but I'm not sure it'd add much? This PR really doesn’t add new functionality, tobler already had categorical support. It provides a dask based way of running it. But I can do another notebook with an illustration, it’s just that it’ll be the same user case as with the single core. The benefit is when you have tens/hundreds of thousands/millions of geometries to transfer, which I'm not sure I'd include as an example?

Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested.

it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters

It’s slightly different. We could think of a way of vectorizing pixels and doing a spatial dissolve with dask. I don’t know if that’d be faster (it'd be at least parallel/out-of-core), but it’s definitely different code (though similar philosophy), so I'd be tempted to leave that for a different PR, perhaps create an issue to remember this option in case we have bandwidth (or need) in the future to explore it.

darribas · 2023-08-14T10:02:03Z

UPDATE - I think I've fixed the bug where sometimes computation would bomb on the final groupby

…k_ai

knaaptime · 2023-08-14T23:55:45Z

what i'd like to do is give users a sense for when they should reach for this versus the usual function. So for that, it would be useful to have a more concrete example, maybe with an actual use case. I think it would be a real boon to folks if they understood what scale of dataset requires this function versus the normal one, and what actual questions you'd go about asking at that scale. The NYC taxi points thing Rocklin cooked up for the original dask-geopandas example was a nice example of this.

I have no doubts this is a useful addition to the codebase, but i just want to make sure we fully communicate that benefit to the users

it would be good to do some profiling (including the sunk costs) against the in-core functions to give users a sense for when this actually provides a benefit

The notebook has already a bit on this. With the baseline data (in the small hundreds) it's clearly overkill but, as soon as you move to a few thousands, it seems worthwhile. The main catch with all this is that dask is just a way of better using better hardware. The example there is in a beefy machine with 16 cores, but if you have a machine with two cores you won’t see much benefit other than out of core support (i.e., when your data don't fit in memory).

the notebook mentions

NOTE - Timings below do not include computation time required for spatial shuffling and partitioning (which can be substantial with large datasets), or converting from geopandas. These are "sunk costs" that'll only make this approach preferable with large datasets, although they can be computed once and the result stored in disk efficiently (e.g., as Parquet files). Having said that, when "larger" is large enough is not very large in modern terms: from a handful of thousand observations the gains will be substantial if several cores/workers are available.

so it would be nice to see some examples that do include the time reqiured for shuffling and a sense for a real dataset that requires this kind of processing

how heavy are dask/daskgpd? I assume we probably want those as optional dependencies

I'm not sure what you mean by "heavy"? I think they're mostly python by not sure. Install is straightforward on conda/mamba. I've added them to the environment.yml on the PR and the notebook included is run on an environment created with that file. Dask is also widely embedded in the "stack" (e.g., xarray supports multi/out-of core with it by default, scikit-learn can use it for their grid search, etc.). For tobler I'd say, for now, an optional dependency would be the way to go.

yep, that was the question. Lets move these to function-level imports like we treat numba in libpysal and bokeh in splot (or h3 here in tobler). We can leave them in the conda-forge recipe so folks have access by default, but move them out of modul-level imports so they're not required

apart from the walkthrough, can you include a notebook with an actual use case (i.e. that shows what this is actually useful for)?

I can certainly add another illustration of how this can be used, but I'm not sure it'd add much? This PR really doesn’t add new functionality, tobler already had categorical support. It provides a dask based way of running it. But I can do another notebook with an illustration, it’s just that it’ll be the same user case as with the single core. The benefit is when you have tens/hundreds of thousands/millions of geometries to transfer, which I'm not sure I'd include as an example?

What i mean is, one of the things the library struggles with is user documentation, and demonstrating to folks how to actually apply our tools. The current notebook is really focused on showing that the dask implementation matches the existing implementation, and that when you 40x a single dataset, the dask-scaling kicks in as it should. That's a really nice resource for us developers, but its still kinda opaque for users trying to figure out which tool to lift off the shelf. I can imagine lots of folks wondering if their data is "big enough" for this to matter, so in an ideal world, it would be nice to give them some orientation. Obviously we cant test datasets at a ton of different scales, but often a motivating use-case is more insightful than an abstract one (and it sounded like maybe you hand a land-use one on hand?). So that's what i was curious about

I definitely wouldnt say this is a blocker, and im happy to merge even if we dont build out more examples, but i wanted to raise the issue while there's some energy behind this

Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested.
it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters

It’s slightly different. We could think of a way of vectorizing pixels and doing a spatial dissolve with dask. I don’t know if that’d be faster (it'd be at least parallel/out-of-core), but it’s definitely different code (though similar philosophy), so I'd be tempted to leave that for a different PR, perhaps create an issue to remember this option in case we have bandwidth (or need) in the future to explore it.

cool that makes sense. Was just curious if this might be useful as a drop in scaler for some other parts of the package

darribas · 2023-08-15T08:33:20Z

Too many threads here, all worrth keeping track:

I've spun off Explore Dask as a scaler #181 to keep track of the idea of using Dask elsewhere in the library
I think I see your point about the walkthrough for end-users. Do you think a good example would be moving NLCD to US tracts? If so, would you be game to get the notebook done together?
In that notebook, we can time the "sunk cost" of spatially partitioning the tables. I think it's unfair whichever way you do it because, if you don't time it you benefit the Dask approach because you need to do it to use Dask but, if you time it, it's relly a pre-step that you may already have computed (a bit like including the computation time of building a W when you run a LISA... If you are comparing a LISA map with a choropleth, would you time the computation of building the weights?). In any case, it is doable and we can include it.
Cool on function level imports (maybe module since all the Dask functionality is on a separate file right now?). Do you have any example of how we have done this elsewhere I can copy to add here?
Finally, would you be game to think through the intensive/extensive cases a little bit to see if they're low-hanging fruits we can squeeze on this PR or a longer-term ambition? If the former, it'd open the door, for example, to calculate NDVI on rasters (e.g., LANDSAT) and then transfer it to polygons (e.g., Census blocks) keeping as much detail as possible. If so, maybe a 1h call could be enough to brainstorm it?

knaaptime · 2023-08-16T15:26:41Z

update:

@darribas and @knaaptime had a call on 8/15 to discuss expansion and integration of dask-based interpolation. Summarizing, we decided this PR likely has the logic/bones necessary to implement intensive/extensive interpolation (in addition to categorical) but initial tests dont pass. We thought maybe refactoring slightly, so that the distributed task works only on building the correspondence table, (not the interpolation/multiplication through the table) but in that case the final interpolation/aggregation after coming back from the workers seems very expensive. We've decided to move forward with the option for categorical variables using dask with the idea that intensive/extensive should be on the menu and may be a good task for someone to pick up in the near future

knaaptime · 2023-08-17T17:37:00Z

tobler/area_weighted/area_interpolate_dask.py

+import dask_geopandas
+import numpy as np
+from dask.base import tokenize
+from dask.highlevelgraph import HighLevelGraph


we need to move these into a try/except inside the function (like with h3fy), otherwise they will force dask to be installed on import

(Levi also has a more clever way of doing this with decorators)

knaaptime · 2023-08-17T17:46:40Z

tobler/tests/test_area_interpolators.py

+            dask_geopandas.from_geopandas(sac2, npartitions=2)
+            .spatial_shuffle(by='hilbert', shuffle='tasks')
+    )
+    area = area_interpolate_dask.area_interpolate_dask(


ah. The issue with the tests is that this is a function not a method. do area_interpolate_dask(), not area_interpolate_dask.area_interpolate_dask()

Thanks! I've changed and I'll how the CI takes it!

codecov · 2023-08-18T13:51:31Z

Codecov Report

Merging #180 (17841d3) into main (df0cbc6) will increase coverage by 3.12%.
The diff coverage is 98.18%.

@@            Coverage Diff             @@
##             main     #180      +/-   ##
==========================================
+ Coverage   51.39%   54.52%   +3.12%     
==========================================
  Files          14       15       +1     
  Lines         786      840      +54     
==========================================
+ Hits          404      458      +54     
  Misses        382      382

Files Changed	Coverage Δ
tobler/area_weighted/area_interpolate_dask.py	`98.07% <98.07%> (ø)`
tobler/area_weighted/__init__.py	`100.00% <100.00%> (ø)`
tobler/area_weighted/area_interpolate.py	`94.52% <100.00%> (+0.72%)`	⬆️

darribas · 2023-08-18T14:10:36Z

Yahoo! tests passing!

darribas added 4 commits August 10, 2023 15:24

Start commit

64e3bb8

Merge

09ec54c

Merge init in AI

8b4c28f

Add demo notebook and docs, remove extensive var support. Ready for r…

cc0dbf7

…eview

darribas added enhancement New feature or request WIP Work in progress, do not merge labels Aug 11, 2023

darribas requested review from knaaptime and martinfleis August 11, 2023 10:54

darribas marked this pull request as draft August 11, 2023 11:00

Add example to notebook reproducing known bug

3e78f4e

Fix bug that appears sometimes w/ many small partitions

9ef8dc0

darribas added 4 commits August 14, 2023 11:41

Remove debugging code

4cf1063

Add categorical_frequency

f71d984

Merge branch 'dask_ai' of https://github.com/darribas/tobler into das…

541859f

…k_ai

Integrate categorical_frequency into Dask

bf157ef

darribas mentioned this pull request Aug 15, 2023

Explore Dask as a scaler #181

Open

darribas added 2 commits August 16, 2023 14:40

Move notebook to notebooks folder

c253716

Add test for dask_ai

07469c6

knaaptime requested changes Aug 17, 2023

View reviewed changes

knaaptime reviewed Aug 17, 2023

View reviewed changes

darribas added 3 commits August 18, 2023 13:53

Change test as suggested by @knaaptime

96d5831

Add optional imports for dask(-geopandas)

9f20b02

Minor fix for dask test

6a9cc3f

darribas added 3 commits August 18, 2023 14:21

more typo fixing of dask tests

d6ed8f6

More typos...

d316125

loading dask gdf to memory for tests

17841d3

knaaptime marked this pull request as ready for review August 18, 2023 15:21

knaaptime approved these changes Aug 18, 2023

View reviewed changes

knaaptime merged commit 65a72aa into pysal:main Aug 18, 2023
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Area-weighted interpolation in Dask #180

[WIP] Area-weighted interpolation in Dask #180

darribas commented Aug 11, 2023 •

edited

Loading

knaaptime commented Aug 11, 2023

knaaptime commented Aug 11, 2023

darribas commented Aug 14, 2023

darribas commented Aug 14, 2023

knaaptime commented Aug 14, 2023 •

edited

Loading

darribas commented Aug 15, 2023

knaaptime commented Aug 16, 2023

knaaptime Aug 17, 2023

knaaptime Aug 17, 2023

knaaptime Aug 17, 2023

darribas Aug 18, 2023

codecov bot commented Aug 18, 2023

darribas commented Aug 18, 2023

[WIP] Area-weighted interpolation in Dask #180

[WIP] Area-weighted interpolation in Dask #180

Conversation

darribas commented Aug 11, 2023 • edited Loading

knaaptime commented Aug 11, 2023

knaaptime commented Aug 11, 2023

darribas commented Aug 14, 2023

darribas commented Aug 14, 2023

knaaptime commented Aug 14, 2023 • edited Loading

darribas commented Aug 15, 2023

knaaptime commented Aug 16, 2023

knaaptime Aug 17, 2023

Choose a reason for hiding this comment

knaaptime Aug 17, 2023

Choose a reason for hiding this comment

knaaptime Aug 17, 2023

Choose a reason for hiding this comment

darribas Aug 18, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 18, 2023

Codecov Report

darribas commented Aug 18, 2023

darribas commented Aug 11, 2023 •

edited

Loading

knaaptime commented Aug 14, 2023 •

edited

Loading